Skip to content

privacy-filter: cap GPU memory + release cache to stop VRAM leak#51

Open
lloydmak99 wants to merge 2 commits into
mainfrom
fix/privacy-filter-gpu-leak
Open

privacy-filter: cap GPU memory + release cache to stop VRAM leak#51
lloydmak99 wants to merge 2 commits into
mainfrom
fix/privacy-filter-gpu-leak

Conversation

@lloydmak99
Copy link
Copy Markdown
Contributor

Problem

privacy-filter (inline HF Transformers token-classification server in small-models.yaml, pipeline(..., device_map="auto"), per-request batch_size=32) has no GPU memory bound. Under steady traffic PyTorch's CUDA caching allocator ratchets its reserved memory up and never releases it, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7).

Measured on 2026-05-29 (gpu11, H200 ~140 GB): recreating privacy-filter freed ~93 GB — for a model that needs ~1–2 GB.

Impact

As privacy-filter fills the card (free ~50 GB → ~0 over 1–2 days), the largest co-tenant Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35) can no longer load and crash-loops with torch.AcceleratorError: CUDA error: out of memory. The same leak OOM'd embeddings/whisper on 2026-05-25 ("No available memory for cache blocks"). Affects both small-models hosts (gpu11 + gpu02 — identical config).

This is not a static GPU-budget misconfig of the small models, and not gemma (different GPUs): the vLLM/SGLang co-tenants hard-cap their VRAM, so the only unbounded consumer is the raw-HF privacy-filter.

How it was isolated

Recreate-and-watch (per-process nvidia-smi is unreachable — CVMs reject SSH, compose-manager has no exec): recreating FLUX freed only its ~22 GB static pool and Qwen3-VL kept crash-looping; recreating privacy-filter freed ~93 GB and Qwen3-VL recovered.

Fix

Inline server.py + container env:

  • torch.cuda.empty_cache() after every request (core fix) — returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting up.
  • torch.cuda.set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe) — hard ceiling so the process self-OOMs/restarts instead of starving its neighbours. Default GPU_MEM_FRACTION=0.10 (~14 GB on a 140 GB H200), env-tunable without an image rebuild.
  • torch.inference_mode() around inference — no autograd state retained across requests.

Validated: small-models.yaml parses and the embedded server.py compiles.

Deploy

Normal tag + redeploy of small-models.yaml to both hosts (POST :8080/compose/up with the new tag, services:["<privacy-filter container>"], force_recreate:true).

⚠️ Interim: gpu11 was already mitigated by manually recreating the container (frees the leak but recurs in ~1–2 days). gpu02 still needs an immediate recreate until this ships. This PR makes the fix permanent.

Follow-up (optional)

If reserved-memory fragmentation still creeps, add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (left out here to avoid any interaction with the per-process fraction cap on torch 2.5.1).

privacy-filter is an inline HF Transformers token-classification server
(`pipeline(..., device_map="auto")`) with no memory bound. Under steady
traffic the CUDA caching allocator's reserved memory ratchets up and is
never released, so the process slowly hoards the GPU it shares with
Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7). Observed ~93 GB
held on an H200 for a model that needs ~1-2 GB.

As privacy-filter fills the card (free ~50 GB -> ~0 over 1-2 days) the
largest co-tenant, Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35),
can no longer load and crash-loops with
`torch.AcceleratorError: CUDA error: out of memory`. The same leak OOM'd
embeddings/whisper on 2026-05-25. Hits both small-models hosts (gpu11,
gpu02) since they run identical config.

Fix (inline server + container env):
- empty_cache() after every request (core fix): returns cached-but-unused
  CUDA blocks to the driver so reserved memory stops ratcheting.
- set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe): hard
  ceiling so the process self-OOMs/restarts instead of starving neighbours.
  Default 0.10 (~14 GB on a 140 GB H200), env-tunable.
- torch.inference_mode() around inference: no autograd state retained.

Interim mitigation already applied by recreating the container, which
frees the leaked VRAM but recurs in ~1-2 days; this makes it permanent.
Ship via the normal tag + compose/up redeploy of small-models.yaml.
@lloydmak99
Copy link
Copy Markdown
Contributor Author

Tracking issue: nearai/infra#158

@lloydmak99 lloydmak99 requested a review from Evrard-Nil May 29, 2026 23:13
…_segments)

Addresses the code review of the first cut:

- Root cause now fixed at the source: PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  lets the CUDA allocator shrink reserved segments instead of ratcheting up.
- Drop per-request torch.cuda.empty_cache(): a synchronizing cudaFree on the hot
  path stalled the shared GPU and the co-located models it was meant to protect.
  A 30s watchdog thread now releases idle blocks off the request path.
- Real fail-safe instead of a silent 500-storm: the watchdog hard-restarts the
  container (os._exit -> restart:unless-stopped) if this process's reserved VRAM
  exceeds GPU_MEM_LIMIT_GB, and an acute CUDA-OOM in a request also exits. The
  prior "self-OOMs and restarts" comment was false — a caught OOM returned 500
  while the process stayed up behind a still-healthy /v1/models probe.
- Drop set_per_process_memory_fraction: the 0.10 (~14GB) guess could OOM legit
  batch_size=32 requests, and device_map="auto" planned against the full card
  and ignored the cap anyway. Bound the work via PRIVACY_BATCH_SIZE instead;
  inputs are NOT truncated (a privacy filter must see the whole text).
- device=0 instead of device_map="auto" (no accelerate planner mismatch).
- Drop torch.inference_mode(): redundant with the pipeline's internal no_grad
  and stricter (risked raising under trust_remote_code custom models).
- Tolerant env parsing + clamps so a malformed knob can't crash-loop boot.

Validated: small-models.yaml parses and the embedded server.py compiles.
@lloydmak99
Copy link
Copy Markdown
Contributor Author

Revised in 7295a65 to address review:

  • Per-request empty_cache() removed — it was a synchronizing cudaFree on the hot path that would stall the shared GPU and the very models this protects. A 30s watchdog thread now releases idle blocks off the request path.
  • Root cause fixed at the sourcePYTORCH_CUDA_ALLOC_CONF=expandable_segments:True lets the allocator shrink reserved segments instead of ratcheting.
  • Real fail-safe — the "self-OOMs and restarts" claim was false (a caught CUDA OOM returned 500 while the process stayed up behind a healthy /v1/models probe). Now the watchdog os._exits if reserved VRAM exceeds GPU_MEM_LIMIT_GB, and an acute request OOM also exits → restart: unless-stopped actually recycles the container.
  • Dropped set_per_process_memory_fraction — the 0.10 (~14 GB) guess could OOM legit batch_size=32 requests, and device_map="auto" planned against the full card and ignored the cap. Bound the work via PRIVACY_BATCH_SIZE instead; inputs are not truncated (a privacy filter must see the whole text).
  • device=0 instead of device_map="auto"; dropped redundant/stricter torch.inference_mode(); tolerant env parsing so a bad knob can't crash-loop boot.

Validated: YAML parses and the embedded server.py compiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant